survey of load balance

ECMP deploys conveniently but is easy to cause congestion due to large flows collisions. A well designed load balancing mechanism is needed, also faces several challenges.

dcn network architecture

there are several network architecture.

  1. fat tree
  2. VL2( also a Clos-like topo)
  3. CamCube
  4. BCube
  5. DCell

dcn traffic analysis

  1. most traffic stays within the rack
  2. the majority of flows are mice flow, including hellos and meta-data requests.
  3. elephant flow contributes to most of the bytes
  4. traffic originated is ON/OFF, where the ON/OFF duration fits heavy-tailed distribution. The total traffic is random and lacks predictability.
  5. packet losses are mainly caused by microbursts in each layer and traffic is more bursty in aggregation and edge layers than in core layer

recent proposals

lb mechanism

lb consists of 2 procedures.

  • collection congestion information
    • TCP-based: ECN or cwnd
    • sending rate
    • switch queue length
  • selection path
    • least congested
    • less congested
    • (weighted) round robin
    • integrated allocation

centralized mechanisms

centralized mechanisms rely on the central controller on the global view of the architecture. The controller can collect the utilization information and make the action. However, additional communication overhead will be induced to transfer information. Coarse time granularity is another limitation.

Hedera

Hedera proposes a method to estimate large flows and then assigns them centrally to maximize network utilization with minimal scheduler overhead.

An elephant flow is detected if its rate is larger than 10% of the link capacity at edge switches.

two proposed approaches are used to schedule the elephant flows. they can run in 10-100 ms level.

  1. Global First Fit: picks a path whose capacity can accommodate the flow.
  2. Simulated Annealing: aggregates elephant flow with the same destination and assigns a core switch to them to reduce the search space.

PortLand routing to tolerant link failure.

shortage:

  1. not all the large flow detected.
  2. elephant flows are not assigned to proper paths until their rates reach the limitation
  3. although statistics collection and large flow scheduling can be finished quickly, the whole operation period in the simulation is 5 seconds

Mahout

mahout marks the elephant flow at end-host by monitoring the socket buffer. When the buffer exceeds a threshold, some packets of a flow will be sent to the controller. The marking process works every $T_{target}$ second(i.e. 1s). Flows with marks transmit through the least-congested available path, while the other flows uses ECMP.

Mahout outperforms ECMP, and needs a small overhead to schedule flows since it works at end-host, compared to Hedera.

Mahout still works in the level of milliseconds, and latency-sensitive mice flows could miss the deadline using ECMP.

FDALB

The short flows are transmitted using ECMP, while the long flows tagged at the end-host, when the sent bytes exceed a threshold, are scheduled by greedy round-robin, based on the assumption that under the AIMD TCP, the flows would equally shared the maximum bandwidth.

Freeway

Freeway divides paths into low-latency paths consists of low-latency oriented links(LOL) and high throughput paths consisted of high-throughput oriented links(HOL). The ratio of LOL changes according to the path utilization. Mice flows are transmitted using LOL paths with ECMP, and elephant flows pre-exchange information with the controller and are scheduled by the controller.

Simulations show Freeway performs closely to the optimal per-packet load balancing mechanism PerPktLB, which suffers great packet reordering, leading to performance degradation. At the same time, Freeway leaves the residual low latency paths capacities unused.

energy aware method

energy aware method focuses more on how to allocate the task in a cloud computing data center.

distributed mechanisms

MPTCP

shortage:

The short flow completion time is longer. Gain may be little or limited in some scenarios.

FlowBender

FLowBender works on end-host to avoid modifying switches. It monitors the fraction of ECN marked packets. When congestion detected, it changes the packet header to make switches re-hash its route.

It may need several times to find a proper path.

Clove

COLVE are implemented in soft edge switches to avoid hardware and end-host stack modifications. CLOVE uses standard ECMP in the physical network and changes the header of packets at software switches to directly guide how switches transfer packets.

Software switches in CLOVE simultaneously send many probes by varying source port address. Then the source hypervisor can learn how to choose the source port address to get to the destination from a desired path.

CLOVE adopts flowlet to avoid packet reordering and detects flowlet at the soft edge switches.

CLOVE balances flowlet in a weighted round-robin fashion. The first way is CLOVE-ECN that adopts Explicit Congestion Notification (ECN) to detect congestion and calculate the weight. The second is CLOVE-INT that collects the exact path utilization using In-band Network Telemetry (INT) .

CLOVE-ECN outperforms ECMP and CLOVE-INT performs similarly to CONGA.

PEFT

PEFT adjusts the weight allocation periodically according to the measurement. Every 30 seconds or when link state changes the measured matrix is broadcasted.

It outperforms ECMP but adapts slowly.

Conga

see conga.

conga still need several RTTs (10-100 microseconds) to update, by which time the congestion may have disappeared.

when flows arrive at the same time, conga may choose the same least congested flow, causing flow collisions.

Expeditus

Expeditus works in clos-like topo, collecting only one-hop congestion info.

  • one-hop congestion information collection. each switch monitor southbound (from upstream to itself) and northbound (from itself to upstream switches). northbound information is collected from egress ports. southbound information is collected by the incoming packet header carrying the switch’s egress ip and rate.
  • two-stage path selection. SYN and SYN-ACK packets are routed using ECMP. On receiving the SYN packet, the destination edge switch can determine the destination aggregation switch and core switch for the reverse path. On receiving the SYN-ACK packet, the source edge switch can determine the source aggregation and core switch for the forwarding path.

Expeditus achieves 25%-30% average flow completion time and tail latency than ECMP and Conga (need to read the paper).

HULA

HULA only saves the best next hop to the destination in each switch by sending low-cost probe packet periodically.

The probe header contains 2 elements, which is dest TOR id, and Path Utilization. For each switch receives the probe, it updates the probe header and its own Path-Util-Table, which records destination ip, best hop, and utilization.

HULA achieves lower flow completion time than ECMP and CONGA for both large flows and
small flows under different workloads. It performs better also in asymmetric scenarios, e.g. link failure.

congestion agnostic mechanism